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ABSTRACT 

The problem of feature selection has raised considerable in¬ 
terests in the past decade. Traditional unsupervised meth¬ 
ods select the features which can faithfully preserve the in¬ 
trinsic structures of data, where the intrinsic structures are 
estimated using all the input features of data. However, 
the estimated intrinsic structures are unreliable/inaccurate 
when the redundant and noisy features are not removed. 
Therefore, we face a dilemma here: one need the true struc¬ 
tures of data to identify the informative features, and one 
need the informative features to accurately estimate the true 
structures of data. To address this, we propose a unified 
learning framework which performs structure learning and 
feature selection simultaneously. The structures are adap¬ 
tively learned from the results of feature selection, and the 
informative features are reselected to preserve the refined 
structures of data. By leveraging the interactions between 
these two essential tasks, we are able to capture accurate 
structures and select more informative features. Experi¬ 
mental results on many benchmark data sets demonstrate 
that the proposed method outperforms many state of the 
art unsupervised feature selection methods. 


1. INTRODUCTION 

Real world applications usually involve big data with high 
dimensionality, presenting great challenges such as the curse 
of dimensionality, huge computation and storage cost. To 
tackle these difficulties, feature selection techniques are de¬ 
veloped to keep a few relevant and informative features. Ac¬ 
cording to the availability of label information, these algo¬ 
rithms can be categorized into supervised [^, [^, [21| , 
1^ , semi-supervised 33 , 27 and unsupervised algorithms 
^ . Compared to supervised or semi-supervised counter¬ 

parts, unsupervised feature selection is generally more chal¬ 
lenging due to the lack of supervised information to guild 
the search of relevant features. 

Unsupervised feature selection has attracted much atten¬ 
tion in recent years and a number of algorithms have been 
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proposed ^ . Without class label, unsuper¬ 

vised feature selection chooses features that can effectively 
reveal or maintain the underlying structure of data. Recent 
research on feature selection and dimension reduction has 
witnessed that several important structures should be pre¬ 
served by the selected features. These important structures 
include, but not limited to, the global structure [36[|17| , the 
local manifold structure [10| |11| and the discriminative in¬ 
formation [28[ 115] . And these structures can be captured by 
widely used models in the form of graph, such as, the sam¬ 
ple pairwise similarity graph [^, the fc-nn graph the 
global integration of local discriminant model [28[ |3ll , the 
local linear embedding (LLE) [17| . 

Clearly, many of existing unsupervised feature selection 
methods rely on the structure characterization through some 
kind of graph, which can be computed within the original 
feature space. And once the graph is determined, it is fixed 
in the next procedures, such as sparse spectral regression [^, 
to guide the search of informative features. Consequently, 
the performance of feature selection is largely determined by 
the effectiveness of graph construction. Ideally, such graphs 
should be constructed only using the informative feature 
subset rather than all candidate features. Unfortunately, 
the desired subset of features is unknown in advance, and 
the irrelevant or noisy features would be inevitably intro¬ 
duced in many real applications. As a result, unrelated or 
noisy features will have an adverse effect on the character¬ 
ization of the structures and henceforth hurt the following 
feature selection performance. 

In unsupervised scenario, this is actually the chicken-and- 
egg problem between structure characterization and feature 
selection. Facing with such dilemma, we propose to perform 
feature selection and structure learning in a unified frame¬ 
work, where each sub task can be iteratively boosted by us¬ 
ing the result of the other one. Concretely, the global struc¬ 
ture of data is captured within the sparse representation 
framework, where the reconstruction coefficient is learned 
from the selected features. The local manifold structure is 
revealed by a probabilistic neighborhood graph, where the 
pairwise relationship is also determined by the selected fea¬ 
tures. When the global and local structures are given in the 
form of graph Laplacians, we seek the relevant features via 
sparse spectral regression with the help of graph embedding 
for cluster analysis. In this way, both the global and lo¬ 
cal structure of data can be better captured by only using 
the selected features; Moreover, with the refined character¬ 
ization of the structure, a better search of the informative 
features could also be expected. 
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Figure 1: An illustration of unsnpervised filter methods and four type embedded methods. 


It is worthwhile to highlight several aspects of the pro¬ 
posed approach here 

1. Based on the different learning paradigms for unsuper¬ 
vised feature selection, we investigate most of exist¬ 
ing unsupervised embedded methods and further clas¬ 
sify them into four closely related but different types. 
These analyses provide more insight into what should 
be further emphasized on the development of more es¬ 
sential unsnpervised feature selection algorithm. 

2. We propose a novel unihed learning framework, called 
unsupervised Feature Selection with Adaptive Struc¬ 
ture Learning (FSASL in short), to fulfil the gap be¬ 
tween two essential sub tasks, i.e. structure learning 
and feature learning. In this way, these two tasks can 
be mutually improved. 

3. Comprehensive experiments on benchmark data sets 
show that our method achieves statistically significant 
improvement over state-of-the-art feature selection meth¬ 
ods, suggesting the effectiveness of the proposed method. 

The rest of this paper is arranged as follows. We review 
the related work in Section 2. Then we present our proposed 
formulation and the developed solution in Section 3. Discus¬ 
sions are introduced in Section 4. Extensive experiments are 
conducted and analyzed in Section 5. Section 6 concludes 
this paper with future work. 

2. RELATED WORKS 

In this section, we mainly review most existing unsuper¬ 
vised feature selection methods, i.e. hlter and embedded 
methods. Unsupervised hlter methods pick the features one 
by one based on certain evaluation criteria, where no learn¬ 
ing algorithm is involved. The typical methods include: 
max variance (MaxVar) , Laplacian score (LapScore) , 
spectral feature selection (SPEC) [^, feature selection via 
eigenvalue sensitive criterion (EVSC) [^. A common limi¬ 
tation of these approaches is the correlation among features 
is neglected [^. 

Unsnpervised embedded approaches are developed to per¬ 
form feature selection and ht a learning model simultane¬ 
ously. Based on the different sub-steps involved in the fea¬ 
ture selection procedure, these embedded methods can be 


further divided into four different types as illustrated in Fig¬ 
ure [T] 

The hrst type of embedded methods hrst detect the struc¬ 
ture of the data and then directly select those features which 
is used to best preserve the enclosed structure. The typical 
methods include: trace ratio (TraceRatio) [20| and unsu¬ 
pervised discriminative feature selection (UDF S) [28| . Trac¬ 
eRatio is prone to select redundant features |17| and the 
learning model of UDFS is often too restrictive [22] . 

The second type of embedded methods first construct vari¬ 
ous graph Laplacians to capture the data structure, then flat 
the cluster structure via graph embedding, and finally use 
the sparse spectral regression to select those features that 
are best aligned to the embedding. Instead of directly select¬ 
ing features as the hrst type, these approaches resorted to an 
intermediate cluster analysis sub-step to reveal the cluster 
structure for guiding the selection of features. The cluster 
structure discovered by either the graph spectral embedding 
or other clustering module can be seen as an approximation 
of the unseen labels. The typical methods include: multi¬ 
cluster feature selection (MCFS) [^, minimum redundancy 
spectral feature selection (MRSF) [^, similarity preserv¬ 
ing feature selection (SPFS) [36|, and joint feature selection 
and subspace learning (FSSLjT^ , global and local structure 
preserving feature selection (GLSPFS) [17| . 

Unlike the second type methods, the clustering analysis 
in the third type of embedded methods is co-determined 
by the embedding of the graph Laplacian and the adap¬ 
tive discriminative regularization [^, [^, which can be 
obtained from the result of sparse spectral regression. By 
using the feedback from feature selection, the whole learn¬ 
ing procedure can provide better cluster analysis, and vice 
versa. The typical methods include: joint embedding learn¬ 
ing and spectral regression (JELSR) [12| , [11] , nonnegative 
discriminative feature selection (NDFS) [15] , robust unsu¬ 
pervised feature selection (RUFS) [^, feature selection via 
clustering-guided sparse structural learning (CGSSL) ]14| . 

The fourth type of embedded methods try to feed the 
result of feature selection into the structure learning proce¬ 
dure for improving the quality of structure learning. In , 
a feature selection method is proposed for local learning- 
based clustering (LLCFS), which incorporates the relevance 










of each feature into the built-in regularization of the local 
learning model, where the induced graph Laplaciau can be 
iteratively updated. However, LLCFS uses the discrete k- 
nearest neighbor graph and does not optimize the same ob¬ 
jective function in structure learning and feature search. 

It can be seen that all these above methods (except LL¬ 
CFS) share a common drawback: they use all features to 
estimate the underlying structures of data. Since the re¬ 
dundant and noisy features are unavoidable in real world 
applications, that is also why we need feature selection, 
the learned structures using all features will also be con¬ 
taminated, which would degrade the performance of fea¬ 
ture selection. By leveraging the coherent interactions be¬ 
tween structure learning and feature selection, our proposed 
method FSASL seamlessly integrates them into a unified 
framework, where the result of one task is used to improve 
the other one. 


3. UNSUPERVISED FEATURE SEUECTION 
WITH ADAPTIVE STRUCTURE UEARN- 
ING 

Let X = {xi,...,x„} G T^dxn (jenotes the data matrix, 
whose columns correspond to data instances and rows to 
features. The geueric problem of unsupervised feature se¬ 
lection is to find the most informative features. With the 
absence of class label to guild the search of relevant features, 
the data represented with the selected features should well 
preserve the intrinsic structure as the data represented by 
all the original features. 

To achieve this goal, we propose to jointly perform un¬ 
supervised feature selection and data structure learning si¬ 
multaneously, where both the global and local structure are 
adaptively updated using the result of current feature selec¬ 
tion. 

We first summarize some notations used throughout this 
paper. We use bold uppercase characters to denote matrices, 
bold lowercase characters to denote vectors. For an arbitrary 
matrix A G ai means the f-th column vector of A 

and aj means the j-th row vector of A, Aij denotes the 
(i,j)-th entry of A. The ^ 2 , 1 -norm is defined as ||A||2i = 

ELi 


3.1 Adaptive Global Structure Uearning 

Over the past decades, a large number of algorithms have 
been proposed based on the analysis of the global structure 
of data, such as the Principal Component Analysis (PCA) 
and the Maximum Variance (MaxVar). Recently, the global 
pairwise similarity (e.g., with a Gaussian kernel) between 
high-dimensional samples has demonstrated promising per¬ 
formance for unsupervised feature selection [36[ [T7| . How¬ 
ever, such dense similarity becomes less discriminative for 
high dimension data, especially when there are many unfa¬ 
vorable features in the original high dimensional space. 

Inspired by the recent development on compressed sensing 
and sparse representation , we use the sparse reconstruc¬ 
tion coefficients to extract the global structure of data. In 
sparse representation, each data sample Xi can be approxi¬ 
mated as a linear combination of all the other samples, and 
the optimal sparse combination weight matrix S G 


can be obtained by solving the following problem 

n 

nun ^ (||xi — Xsill^-I-Q||si||i) s.t. Su = 0 (1) 

where a is used to balancing the sparsity and the recon¬ 
struction error. Compared with the pairwise similarity, the 
sparse representation is naturally discriminative: among all 
the candidates samples, it selects the samples which most 
compactly expresses the target and rejects all other possible 
but less compact candidates |26| . 

Clearly, the selected features should preserve such global 
and sparse reconstruction structure. To achieve this, we 
introduce a row sparse feature selection and transformation 
matrix W G to the reconstruction process, and get 

n 

min ^jjW^Xi - W^Xs.ll^+a||S||i-G 7 IIWII 21 ( 2 ) 

i=l 

s.t. Sii = 0, W^XX^W = I 

where 7 is regularization parameter. Compared with the 
Eq. 0: the benefits of Eq. 0 are two folds: 1) The global 
structure captured by S can be used to guide the search of 
relevant features; 2) By largely eliminating the adverse effect 
of noisy and unfavorable features, the global structure can 
also be better estimated. 


3.2 Adaptive Uocal Structure Uearning 

The importance of preserving local manifold structure has 
been well recognized in the recent development of unsu¬ 
pervised feature selection algorithms, especially considering 
that high-dimensional data often presents a low-dimensional 
manifold structure IS IT? . To detect the underlying lo¬ 


cal manifold structure, these algorithms usually first con¬ 
struct a fc-nearest neighbor graph and then compute the 
graph Laplacian with different models. Clearly, both the k- 
nn graph and the graph Laplacian are determined by all the 
relevant and irrelevant features. As a result, the manifold 
structure captured by such graph Laplacian would be in¬ 
evitably affected by the redundant and noisy features. More¬ 
over, the iterative updating of discrete neighborhood rela¬ 
tionship using the result of feature selection still suffers from 
the lack of theoretical guarantee of its convergence [32[ [M| . 

Instead of using the graph Laplacian with the determinate 
neighborhood relationship, we introduce to directly learn a 
euclidean distance induced probabilistic neighborhood ma¬ 
trix. For each data sample Xi, all the data points 
are considered as the neighborhood of Xi with probability 
Fij, where P G can be determined by solving the 

following problem: 


nun^dlxi - XjjliPy + ^Pfj), s.t. P1„ = 1„,P>0 (3) 
id 

where is the regularization parameter. The regularization 
term is used to 1 ) avoid the trivial solution; 2 ) add a prior 
of uniform distribution. It can be found that a large dis¬ 
tance ||xi — XjH 2 will lead to a small probability P^. With 
such nice property, the estimated weight matrix P and the 
induced Laplacian Lp = Dp — (P -I- P^)/2 can be used 
for local manifold characterization, where Dp is a diagonal 
matrix whose i-th diagonal element is E ,(P,,+P,0/2. 

To leverage the result of feature selection and iteratively 
improve the probabilistic neighborhood relationship, we also 




introduce the feature selection and transformation matrix 
W as used in global structure adaptive learning, and we get 

n 

mm ^(||W^x.-W^x,||^P,,+/rP?,) + 7 l|W|| 2 i (4) 

s.t. Pl„ = 1„,P > 0,W^XX^W = I 


Next, when and S are fixed, we need to solve n de¬ 
coupled sub problems in the following form: 

n 

“in ^||x'-x' ll^Py-b/r||P,J||^ (7) 

pf ,=i 

s.t. l^pi = 1 , Pij > 0 


With the sparsity on W, the irrelevant and noisy features 
can be largely removed, thus we can learn a better prob¬ 
abilistic neighborhood graph for local structure characteri¬ 
zation based on the result of feature selection, i.e. W^X. 
Moreover, we aim to seek those features to preserve the local 
structure encoded by P. Thus, the optimization problem in 
Eq. |4f can be used to perform feature selection and local 
structure learning, simultaneously. 

3.3 Unsupervised Feature Selection with Adap¬ 

tive Structure Learning 

Based on the two adaptive structure learning models pre¬ 
sented in Eq. § and Eq. we propose a novel unsu¬ 
pervised feature selection method by solving the following 
optimization problem, 

^in^ (llW^X-W^XSII^-f a||S||i^ (5) 

n 

+ /5^(||W^x,-W^x,||2p,,+/rP?,.) +7l|W||2i 

ij 

s.t. S,, = 0,P1„ = 1„,P>0,W^XX'^W = I 


Denote A € 77."^" be a square matrix with Aij = 

Xjll^, then the above problem can be written as follows 

mill - af 11 ^, s.t. pfln = 1,0 < p^ < 1 ( 8 ) 

where pf is the i-th row of P. The above euclidean pro¬ 
jection of a vector onto the probability simplex can be ef¬ 
ficiently solved by Algorith m [T] without iterations. More 
details can be found in Eq. (|19[). 


Algorithm 1 The optimization algorithm of Eq. (81 


Input: a 

sort a into b where 6 i > 62 >,..., 

hnd p = max{l < j < n ■. bj + bi) > 0 } 

define 2 = ^(l-ELi^O 
Output: p with pj = max{aj -|- 2 , 0}, j = 1,..., n 


Next, when S and P are fixed, we need to solve the fol¬ 
lowing problem: 


where /3 and 7 are regularization parameters balancing the 
fitting error of global and local structure learning in the first 
and second group and the sparsity of the feature selection 
matrix in the third group. 

It can be seen that when both S and P are given, our 
method selects those features to well respect both the global 
and local structure of data. When the feature selection ma¬ 
trix W is given, our method learns the global and local 
structure of data in a transformed space, i.e. W^X, where 
the adverse effect of noisy features is largely alleviated with 
sparse regularization. In this way, these two essential tasks 
can be boosted by the other one within a unified learning 
framework. Since both the global and local structure can 
be adaptively refined according to the result of feature se¬ 
lection, we call Eq. § unsupervised Feature Selection with 
Adaptive Structure Learning (FSASL). 

3.4 Optimization Algorithm 

Because the optimization problem in Eq. (© comprises 
three different variables with different regularizations and 
constraints, it is hard to derive its closed solution directly. 
Thus we derive an alternative iterative algorithm to solve 
the problem, which converts the problem with a couple of 
variables (S, P and W^) into a series of sub problems where 
only one variable is involved. 

First, when W and P are fixed, we need to solve n decou¬ 
pled sub problems in the following form: 

min ||xi — X Si||^-I-a|si|, s.t. Sii = 0 (6) 

Si 

where X is the new transformed data by projecting the rel¬ 
evant features into a low dimension space, and X = W^X. 
The above LASSO problem can be efficiently solved by rou¬ 
tine optimization tools, e.g. proximal methods [3l|16|. 


min IIW'^X-W^XS||2-k^y IIW^x, - W^XjIl'^Pij -k7||W 

s.t. W^XX^W = I (9) 

Using Ls = (I - S)(I - S)^, Lp = Dp - (P -k P^)/2 and 
let L = Ls + /ILp, the above problem can be rewritten as 

min rr(W^XLX^W^)-k 7 !|W|| 2 i (10) 

W 

s.t. W^XX^W = I 


Due to the non-smooth regularization, it is hard to obtain 
the close form solution. We solve it in an iterative way. 
Given the t-th estimation W‘ and denote Dwt be a diagonal 
matrix with the i-th diagonal element as 2 \\w*\\^ ’ ^ 0 - (101 
can be rewritten as: 


imn Tr (^W^X(L-f (11) 

s.t. W^XX^W = I 


The optimal solution of W are the eigenvectors correspond¬ 
ing to the c smallest eigenvalues of generalized eigen-problem: 

X(L + 7Dwt)X^W = AXX^W (12) 

where A is a diagonal matrix whose diagonal elements are 
eigenvalues. To get a stable solution of this eigen-problem, 
the matrices XX^ is required to be non-singular which is not 
true when the number of features is larger than the number 
of samples. Moreover, the computational complexity of this 
approach scales as 0 (d® + ncf), which is costly for high 
dimensional data. Thus, such solution is less attractive in 
real world applications. To improve the effectiveness and 
the efficiency to optimize Eq. ( |10[), we further resort to a 
two steps procedure inspired from W. 









Theorem 1. Let Y G TZ"^'^ be a matrix of which each 
column is an eigenvector of eigen-problem Ly = Ay. If 
there exists a matrix W G such that X^W = Y, then 

each column o/W is an eigenvector of the generalized eigen- 
problem XLX^w = AXX^w with the same eigenvalue X. 


Proof. With X^W = Y, the following equation holds 

XLX^w = XLy = XAy = AXy = AXX^w (13) 

Thus, y is the eigenvector of the generalized eigen-problem 
XLX^w = AXX^w with the same eigenvalue A. □ 

Theorem shows that instead of solving the generalized 
eigen-problem in Eq. ( |12[ ), W can be obtained by the fol¬ 
lowing two steps: 


1. Solve the eigen-problem LY = AY to get Y corre¬ 
sponding to the c smallest eigenvalues; 


2. Find W which satisfies X'^W = Y. Since such W 
may not exist in real applications, we resort to solve 
the following optimization problem: 

min ||Y-X^W||''-b 7 l|W|| 2 i (14) 


The optimal solution of Eq. (141 can also be obtained 
from routine optimization tools, such as the iterative 
re-weighted method and the proximal method [161. 


The complete algorithm to solve FSASL is summarized in 
algorithm 


Algorithm 2 The optimization algorithm of FSASL 

Input: The data matrix X G the regularization pa¬ 

rameters a, P, 7 , /i, the dimension of the transformed 
data c. 

repeat 

For each i, update the i-th column of S by solving the 
problem in Eq. 

For each i, update the i-th row of P using Algorithm]^ 


|19| . We provide an effective method to achieve this. For 
each sub problem in Eq. ([^, the Lagrangian function is 

^\\pl - ajW'^ - t{p[ 1„ - 1) - rjf Pi (15) 

where r and rii are the Lagrangian multipliers. According 
to KKT condition, the optimal value can be obtained by 

Pij = {Aij -I- r)+ (16) 

By sorting each row of A into B with ascending order, the 
following inequality holds 

f Biy-\-T>0 for fc'= l,...,fe 
I Biy -I- r < 0 for k' = k 1, ...,n 


Considering the simplex constraint on pf, we further get 

k 

k'=l 

By replacing Eq. ( |18| ) into Eq. ( |16| l , the optimal value of P 
can be obtained by 


Pij — {Aij ^(1 Biki))+ (19) 

fe'=i 

Since Bn,/ = — ^||Wx; — Wx*,/|p = d^,, for each subprob¬ 
lem we have 


2 ^ 


w 

ik 



< fi < 


2 ^ 


w 

i,k' + 1 



k' = l 


( 20 ) 


When /i satisfies the above inequality for i-th example, the 
corresponding pf has k non-zero component. Therefore the 
average non-zero elements in each row of P is close to k 
when we set 



In this way, the search of parameter /r can be better handled 
by searching the neighborhood size k, which is more intuitive 
and easy to tune. 


Compute the overall graph Laplacian L = Ls -f /3Lp; 
Compute W by Eq. (121 or Eq. (141; 
until Converges 

Output: Sort all the d features according to ||wi|| 2 (i = 
1 , ...,d) in descending order and select the top m ranked 
features. 


3.5 Convergence Analysis 

FSASL is solved in an alternative way, the optimization 
procedure will monotonically decrease the objective of the 
problem in Eq. in each iteration. Since the objective 
function has lower bounds, such as zero, the above iteration 
converges. Besides, the experimental results show that it 
converges fast, the time of iteration is often less than 20. 

3.6 The determination of parameter ^ 

Since the parameter fj, is used to control the trade off be¬ 
tween the trivial solution {p = 0) and the uniform distribu¬ 
tion {fi = oo), we would like to keep only top-A: neighbors for 
local manifold structure characterization as the fc-nn graph 


4. DISCUSSION 

In this section, we discuss some approaches which are 
closely related to our method. 

Zeng and Cheung proposed to integrate feature selec¬ 
tion within the regularization of local learning-based clus¬ 
tering (LLCFS), which involves two sub steps: 

1. It constructs the fe-nearest neighbor graph in the weighted 
feature space. 

2. It performs joint clustering and feature weight learning 
by solving the following problem 


min 

Y,{WGbT"^l,z 


s.t. 


EE 

i = l r' = ^ 


P{Y,„,-4wp-K,f 


+ (W^)^diag(z-')W^ ] (22) 

IdZ = l,z > 0 


where z is the feature weight vector and Afxi is the 
fc-nearest neighbor of Xi based on z weighted features. 









Compared with LLCFS, FSASL performs both the global 
and local structure learning in an adaptive manner, where 
only local structure is explored by LLCFS. Moreover, LL¬ 
CFS uses the discrete fe-nearest graph and does not optimize 
the same objective function in structure learning and feature 
search, while FSASL is optimized within a unified framework 
with the probabilistic neighborhood relationship. 

Hon et al. (m proposed the joint embedding learning and 
sparse regression (JELSR) method, which can be formulated 
as follows: 

min tr(Y'^L2Y) + Ai(||Y - X^W||^ + A2IIWII21) 

W,Yr’Y=I 

(23) 

Comparing the formulation in Eq. and Eq. ( |23[ ), the 
main differences between FSASL and JELSR include: 1) 
FSASL selects those features to respect both the global and 
local manifold structure, while JELSR only incorporates the 
local manifold structure; 2) The local structure in JELSR is 
based on fc-nearest neighbor graph, while FSASL learns a 
probabilistic neighborhood graph, which can be easily re¬ 
fined according the result of feature selection. 3) JELSR 
iteratively perform spectral embedding for clustering and 
sparse spectral regression for feature selection as illustrated 
in Fig. 0- However, the local structure itself (i.e. L2) is 
not changed during iterations. FSASL can adaptively im¬ 
prove both the global and local structure characterization 
using selected features. 

Most recently, Liu et al. proposed a global and lo¬ 
cal structure preservation framework for feature selection 
(GLSPFS). It first constructs the pairwise sample similar¬ 
ity matrix K with Gaussian kernel function to capture the 
global structure of data, then decompose K = YY^. Us¬ 
ing Y as the regression target, GLSPFS solve the following 
problem: 

min ||Y - X^W||^ -b AitrCW^XLgX^W) + A2IIWII21 

(24) 

The main differences between FSASL and GLSPFS include: 
1) GLSPFS uses the Gaussian kernel, while FSASL captures 
the global structure within sparse representation, which is 
more discriminant; 2) Both the global and local structures 
(i.e. K and L3) in GLSPFS are based on all features, while 
FSASL refines these structures with selected features. 


5. EXPERIMENTS 

In this section, we conduct extensive experiments to eval¬ 
uate the performance of the proposed FSASL for the task of 
unsupervised feature selection. 

5.1 Data Sets 

The experiments are conducted on 8 publicly available 
datasets, including handwritten and spoken digit/letter recog¬ 
nition data sets (i.e., MFEA from UGI reporsitory and USPS49 
[32| which is a two class subset of USPS), three face image 
data sets (i.e., UMIST JAFFE [^, AR [^), one ob¬ 
ject data set (i.e. COIL [^) and two biomedical data sets 
(i.e., LUNG |18| , TOX The details of these benchmark 
data sets are summarized in Table 0 

5.2 Experiment Setup 


Table 3: Summary of the benchmark data sets and the num¬ 
ber of selected features 


Data Sets 

sample 

feature 

class 

selected features 

MFEA 

2000 

240 

10 

[5,10, 

.., 50] 

USPS49 

1673 

256 

2 

[5,10, 

.., 50] 

UMIST 

575 

644 

20 

[5,10, 

.., 50] 

JAFFE 

213 

676 

10 

[5,10, 

.., 50] 

AR 

840 

768 

120 

[5,10, 

.., 50] 

COIL 

1440 

1024 

20 

[5,10, 

.., 50] 

LUNG 

203 

3312 

5 

[10,20, 

.., 150] 

TOX 

171 

5748 

4 

[10,20, 

.., 150] 


To validate the effectiveness of our proposed FS ASlj^ , we 
compare it with one baseline (i.e., AllFea) and states-of-the- 
art unsupervised feature selection methods, 

• LapScor^ , which evaluates the features according 
to their ability of locality preserving of the data man¬ 
ifold structure. 


• MCF^j^, which selects the features by adopting spec¬ 
tral regression with ^i-norm regularization. 


• LLCFS [32| , which incorporates the relevance of each 
feature into the built-in regularization of the local learning- 
based clustering algorithm. 

• UDF^ 1^ , which exploits local discriminative infor¬ 
mation and feature correlations simultaneously. 

• NDF^ |l5| , which selects features by a joint frame¬ 
work of nonnegative spectral analysis and £2.i-norm 
regularized regression. 

• SPF^ , which selects a feature subset with which 
the pairwise similarity between high dimensional sam¬ 
ples can be maximally preserved. 

• RUF^ , which performs robust clustering and ro¬ 
bust feature selection simultaneously to select the most 
important and discriminative features. 


JELSPj^ [l2[|ll| , which joins embedding learning with 
sparse regression to perform feature selection. 

GLSPF^ 
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which integrates both global pairwise 
sample similarity and local geometric data structure 
to conduct feature selection. 


There are some parameters to be set in advance. For all 
the feature selection algorithms except SPFS, we set k = 5 
for all the datasets to specify the size of neighborhoods 

^For the purpose of reproducibility, we provide the code at 
https://github.com/csliangdu/FSASL 
'"http: //www. cad.zju.edu. cn/home/dengcai/Data/code/ 
LaplacianScore .m 

■^http: //www. cad.ziu.edu. cn/home/dengcai/Data/code/ 
MCFS_p.m 

‘‘http: //www. cs . cmu.edu/~yiyang/UDFS.rar 

"https://sites.google.com/site/zcliustc 

^https://sites.google.com/site/alanzhao 

'https://sites.google.com/site/qianmingjie 

®http: //www. escience. cn/people/chenpinghou 

®We also use the implementation provided by the authors. 






















Table 1: Aggregated clustering results measured by Accuracy (%) of the compared methods. 


Data Sets 

AllFea 

LapScore 

MCFS 

LLCFS 

UDFS 

NDFS 

SPFS 

RUFS 

JELSR 

GLSPFS 

FSASL 



51.78 

51.04 

60.38 

64.94 

67.13 

68.20 

64.58 

67.01 

61.00 

69.94 

MFEA 

68.73 

± 5.51 

± 8.13 

± 8.58 

± 3.32 

± 7.53 

± 9.43 

± 7.99 

± 8.37 

± 8.70 

± 7.19 



0.00 

0.00 

0.00 

0.00 

0.01 

0.22 

0.00 

0.01 

0.00 

1.00 



69.21 

84.77 

94.96 

94.05 

68.12 

83.43 

85.86 

95.16 

94.75 

95.95 

USPS49 

77.70 

± 8.95 

± 1.59 

± 1.44 

± 1.13 

± 8.18 

± 6.66 

± 2.58 

± 0.55 

± 0.61 

± 0.48 



0.00 

0.00 

0.03 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

1.00 



36.73 

44.46 

47.31 

48.04 

52.80 

46.72 

50.87 

53.52 

50.53 

54.92 

UMIST 

42.40 

± 1.18 

± 3.26 

± 0.83 

± 1.92 

± 2.26 

± 1.70 

± 1.95 

± 1.54 

± 0.59 

± 1.89 



0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.01 

0.00 

1.00 



67.62 

73.56 

64.79 

75.48 

74.98 

73.93 

75.75 

77.77 

75.46 

79.29 

JAFFE 

71.57 

± 8.49 

± 4.83 

± 4.08 

± 1.63 

± 2.15 

± 2.85 

± 2.53 

± 1.87 

± 1.61 

± 2.24 



0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

1.00 



25.29 

29.05 

34.22 

30.87 

32.34 

31.06 

34.84 

34.19 

34.12 

36.11 

AR 

30.26 

± 2.89 

± 1.19 

± 2.70 

± 0.35 

± 1.52 

± 2.14 

± 1.90 

± 2.52 

± 1.60 

± 0.75 



0.00 

0.00 

0.05 

0.00 

0.00 

0.00 

0.04 

0.02 

0.00 

1.00 



45.60 

51.50 

50.84 

48.40 

52.22 

56.94 

59.20 

59.53 

57.96 

60.93 

GOIL 

59.17 

± 6.16 

± 5.38 

± 3.76 

± 16.89 

± 6.33 

± 3.43 

± 3.28 

± 4.01 

± 2.27 

± 2.50 



0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.03 

0.00 

1.00 



58.97 

70.42 

71.58 

65.46 

75.52 

73.49 

77.35 

77.86 

77.83 

81.93 

LUNG 

72.46 

± 5.24 

± 3.41 

± 5.85 

± 3.88 

± 1.57 

± 3.43 

± 2.62 

± 3.12 

± 2.70 

± 1.63 



0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

1.00 



40.25 

43.10 

39.28 

47.14 

38.28 

39.93 

49.17 

43.96 

47.38 

50.12 

TOX 

43.65 

± 0.65 

± 1.86 

± 0.49 

± 0.75 

± 1.64 

± 1.13 

± 0.83 

± 1.56 

± 1.93 

± 0.67 



0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

1.00 

Average 

58.24 

49.43 

55.98 

57.92 

59.29 

56.67 

59.21 

62.2 

63.62 

62.38 

66.15 


p] . The weight of fc-nn graph for LapScore and MCFS, 
and the pairwise similarity for SPFS and GLSPFS is based 
on the Ganssian kernel, where the kernel width is searched 
from the grid { 2 “^, 2 “^,..., 2®}5o, where So is the mean dis¬ 
tance between any two data examples. For GLSPFS, we 
report the best resnlts among three local manifold mod¬ 
els, that is locality preserving projection (LPP), LLE and 
local tangent space alignment (LTSA) as in [17| . For LL- 
CFS, UDFS, NDFS, RUFS, JELSR, GLSPFS and FSASL, 
the regularization parameters are searched from the grid 
{10“®, 10“'*,..., 10®}. And the regularization parameter for 
7 is searched from the grid {0.001,0.005, 0.01,0.05, 
where '^max is automatically computed from SLEP [^. For 
FSASL, /i is determined by Eq. ( |21[ ) with k = 5 and c is set 
to be the true number of classes. To fairly compare differ¬ 
ent unsupervised feature selection algorithms, we tune the 
parameters for all methods by the grid-search strategy [221 

With the selected features, we evaluate the performance 
in terms of fc-means clustering by two widely used metrics, 
i.e.. Accuracy (ACG) and Normalized Mutual Information 
(NMI). The results of fc-means clustering depend on the ini¬ 
tialization. For all the compared algorithms with different 
parameters and different number of selected features, we first 
repeat the clustering 20 times with random initialization and 
record the average results. 

5.3 Clustering with Selected Features 

Since the optimal number of selected features is unknown 
in advance, to better evaluate the performance of unsuper¬ 
vised feature selection algorithms, we finally report the av¬ 
eraged results over different number of selected features (the 


range of selected features for each data set can be found in 
Table with standard derivation. For all the algorithms 
(except for AllFea), we also report its p-value by the paired 
t-test against the best results. The best one and those hav¬ 
ing no signihcant difference (p > 0.05) from the best one are 
marked in bold. 

The clustering results in terms of AGG and NMI are re¬ 
ported in Table and Table respectively. For different 
feature selection algorithms, the results in each cell of Table 

and are the mean ± standard deviation and the p-value. 
The last row of Table and Table shows the averaged 
results of all the algorithms over the 8 datasets. 

Gompared with clustering using all features, these unsu¬ 
pervised feature selection algorithms not only can largely 
reduce the number of features facilitating the latter learning 
process, but can also often improve the clustering perfor¬ 
mance. In particular, our method FSASL achieves 13.6% 
and 18.1% improvement in terms of accuracy and NMI re¬ 
spectively with less than 10% features. These results can 
well demonstrate the effectiveness and efficiency of unsuper¬ 
vised feature selection algorithm. It can also be observed 
that FSASL consistently produces better performance than 
the other nine feature selection algorithms, and the im¬ 
provement is in the range from 4.77% to 38.5% in terms 
of clustering accuracy and from 3.98% to 33.8% in terms of 
NMI. This can be mainly explained by the following reasons. 
First, both global and local structure are used to guide the 
search of relevant features. Second, the structure learning 
and feature selection are integrated into a unified frame¬ 
work. Third, both the global and local structures can be 
adaptively updated using the results of selected features. 

5.4 Effect of Adaptive Structure Learning 

















Table 2: Aggregated clustering results measured by Normalized Mutual Information (%) of the compared methods. 


Data Sets 

AllFea 

LapScore 

MCFS 

LLCES 

UDES 

NDFS 

SPFS 

RUFS 

JELSR 

GLSPFS 

FSASL 



53.74 

54.72 

52.77 

54.19 

64.97 

64.92 

63.98 

64.51 

59.26 

66.70 

MFEA 

70.33 

± 4.77 

± 9.14 

± 9.76 

± 3.83 

± 7.54 

± 8.27 

± 7.22 

± 9.07 

± 7.59 

± 6.71 



0.00 

0.00 

0.00 

0.00 

0.03 

0.11 

0.00 

0.06 

0.00 

1.00 



15.88 

63.14 

72.03 

68.12 

62.27 

68.10 

71.73 

72.28 

70.43 

75.88 

USPS49 

23.51 

± 17.98 

± 1.05 

± 5.56 

± 4.46 

± 9.62 

± 16.66 

± 7.23 

± 2.24 

± 2.57 

± 2.28 



0.00 

0.00 

0.03 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

1.00 



55.57 

63.46 

63.42 

65.19 

71.19 

64.90 

68.19 

71.33 

69.16 

72.39 

UMIST 

64.15 

± 2.32 

± 4.93 

± 1.42 

± 2.96 

± 2.77 

± 3.06 

± 2.61 

± 2.06 

± 0.97 

± 2.39 



0.00 

0.00 

0.00 

0.00 

0.01 

0.00 

0.00 

0.00 

0.00 

1.00 



77.28 

79.04 

66.97 

84.25 

82.53 

80.01 

82.00 

85.23 

83.20 

86.42 

JAFFE 

81.52 

± 8.98 

± 5.88 

± 3.47 

± 1.74 

± 3.49 

± 3.06 

± 3.56 

± 3.31 

± 3.17 

± 3.34 



0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

1.00 



63.59 

66.41 

69.01 

67.49 

67.89 

66.94 

69.54 

69.02 

69.44 

70.78 

AR 

65.48 

± 2.36 

± 0.85 

± 1.45 

± 0.27 

± 0.89 

± 1.11 

± 1.10 

± 1.32 

± 0.84 

± 0.63 



0.00 

0.00 

0.01 

0.00 

0.00 

0.00 

0.01 

0.00 

0.00 

1.00 



62.21 

66.19 

64.04 

44.27 

56.29 

69.91 

70.54 

71.37 

69.89 

72.93 

COIL 

75.58 

± 4.98 

± 6.78 

± 4.34 

± 12.61 

± 6.91 

± 4.38 

± 4.48 

± 4.97 

± 4.00 

± 4.44 



0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

1.00 



50.14 

55.68 

60.12 

54.88 

60.57 

61.75 

65.47 

63.54 

63.50 

66.78 

LUNC 

60.37 

± 4.13 

± 2.31 

± 4.65 

± 4.21 

± 1.54 

± 3.32 

± 1.87 

± 2.94 

± 2.99 

± 1.72 



0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

1.00 



10.92 

16.53 

9.68 

22.16 

9.07 

10.13 

25.79 

17.46 

23.49 

27.37 

TOX 

15.87 

± 0.68 

± 2.68 

± 0.75 

± 1.36 

± 1.87 

± 1.03 

± 1.60 

± 3.36 

± 2.77 

± 1.62 



0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

1.00 

Average 

57.10 

48.67 

58.14 

57.26 

57.56 

59.35 

60.83 

64.65 

64.34 

63.55 

67.41 


Here, we investigate the effect of adaptive structure learn¬ 
ing by empirically answering the following questions: 

1. What kind of structure should be captured and pre¬ 
served by the selected features, either global or local 
or both of these structures? 

2. Does the adaptive structure learning lead to select 
more informative features? 



Figure 3: Clustering accuracy w.r.t. 6 different settings of 
FSASL on USPS200. 



5 10 15 20 25 30 35 40 45 50 

# of selected features 


Figure 4: Clustering NMI w.r.t. 6 different settings of 
FSASL on USPS200. 


We conduct different settings of FSASL on USPS200, which 
consists the first 100 samples in USPS49. We solve the 
optimization problem in Eq. Eq. and Eq. 

which uses global, local, and both global and local struc¬ 
tures, respectively. We also distinguish these problems with 
and without adaptive structure learning. Thus, we have 6 
settings in total. Figure and Figure show the results 
of these different settings with different number of selected 
features. The aggregated result over different number of se¬ 
lected features is also provided in Table 
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Figure 2: Clustering accuracy w.r.t. different parameters on JAFFE (a-c) and TOX (d-f). 


Table 4: Aggregated clustering results (%) of 6 different 
settings of FSASL on USPS200. 


Problem 

Variables 

ACC 

NMI 

Eq. ( 

2 

1 

W 

89.17 ± 3.22 

52.01 ± 9.69 

Eq. ( 

2 

1 

w,s 

91.90 ± 2.51 

61.95 ± 7.21 

Eq. ( 

4 

1 

w 

91.48 ± 2.62 

59.10 ± 9.31 

Eq. ( 

4 

1 

W,P 

92.86 ± 2.53 

64.65 ± 8.30 

Eq. ( 

5 

1 

w 

94.65 ± 1.24 

69.94 ± 4.22 

Eq. ( 

5 

1 

W,s,p 

95.53 ± 1.10 

74.20 ± 4.83 


From these results, we can see that: 1) The exploitation 
of both global and local structures (i.e., Eq. (§ + W) out¬ 
perform another two alternatives with only global (i.e., Eq. 
§ + W) or local (i.e., Eq. @ + W) structure. It vali¬ 
dates that the integration of both global and local structure 
is better than the single one. 2) With the update of struc¬ 
ture learning (i.e., Eq. § -f W,S, Eq. -I- W, P and 
Eq. @ + W, S, P ) is better than their counterparts with¬ 
out adaptive structure learning respectively. It shows that 
the adaptive learning in either global and/or local structure 
learning can further improve the result of feature selection. 

5.5 Parameter Sensitivity 

We investigate the sensitivity with respect to the regu¬ 
larization parameters a, 13 and 7. When we vary the value 
of one parameter, we keep the other parameters fixed at 
the optimal value. We plot the clustering accuracy with re¬ 


spect to these parameters on JAFFE and TOX in Figure 
The experimental results show that our method is not 
very sensitive to a, (3 and 7 with wide ranges. However, the 
performance is relatively sensitive to the number of selected 
features, which is still an open problem. 

6. CONCLUSION 

In this paper, we proposed a novel unsupervised feature 
selection method to simultaneously perform feature selection 
and the structure learning. In our new method, the global 
structure learning and feature selection are integrated within 
the framework of sparse representation; the local structure 
learning and feature selection are incorporated into the prob¬ 
abilistic neighborhood relationship learning framework. By 
combining both the global and local structure learning and 
feature selection, our method can boost both these two es¬ 
sential tasks, i.e., structure learning and feature selection, 
by using the result of the other task. We derive an efficient 
algorithm to optimize the proposed method and discuss the 
connections between our method and other feature selection 
methods. Extensive experiments have been conducted on 
real-world benchmark data sets to demonstrate the superior 
performance of our method. 

In the future, we plan to further investigate the following 
aspects of FSASL. 1) FSASL has three parameters to tune, 
which is computational cumbersome for real applications. 
To reduce such burden, we will replace the convex regular¬ 
izations on S and W with the £0 or £20 norm. 2) FSASL 
is required to solve a eigen-problem, which is computational 









































































prohibitive for large scale data. Based on the connection 
between spectral clustering and kernel k-means [^, we will 
develop an iterative algorithm without eigen-decomposition 
and thus make FSASL paralleled. 
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